Skip to content

Conversation

@shlomi-noach
Copy link
Contributor

Storyline: #205

WORK IN PROGRESS: resurrecting a migration after failure.
The idea is that gh-ost would routinely dump migration status/context. It would be ossible for one gh-ost process to fail (e.g. having met critical-load) and for another gh-ost process to pick up from where the first left off.

Initial commits present exporting of migration context, with some shuffling & cleanup.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Dec 20, 2016

TODO:

  • must not export passwords

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Dec 20, 2016

Export is to changelog table. This ensures atomicy and durability of write, assuming changelog table is InnoDB. notable is that if migrated table is MyISAM, so is the changelog table.

I'm fine stating that resurrection does not work on MyISAM, because MyISAM.

@shlomi-noach
Copy link
Contributor Author

5f25f74 makes for something that works! I'll need to iterate to see what has been overlooked, but basically we're getting there fast.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Dec 23, 2016

A concern is to not rely on the streamer's last known position, because streamer writes to a buffer (currently hard coded to 100 events). Those events would be lost upon resurrection.

  • Instead, we should have the migration report the last applied event's coordinates.

Shlomi Noach added 6 commits December 23, 2016 15:24
StreamerBinlogCoordinates -> AppliedBinlogCoordinates
updating AppliedBinlogCoordinates when truly applied; no longer asking streamer for coordinates (because streamer's events can be queued, but not handled, a crash implies we need to look at the last _handled_ event, not the last _streamed_ event)
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 24, 2016 17:58 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001tb December 27, 2016 06:10 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 11:36 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 12:24 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 21:06 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 21:17 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 29, 2016 08:24 Active
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 29, 2016 08:27 Active
@shlomi-noach shlomi-noach mentioned this pull request Dec 29, 2016
@shlomi-noach shlomi-noach deployed to production/ghost-mysql001 December 30, 2016 06:03 Active
@tomkrouper
Copy link

Off issue, you were mentioning gh-ost was having checksum issues with resurrections. When you mentioned that I was thinking, could it be related to the fact that we have two things going on: the backlog and the iteration of inserts? I hope that makes sense. None the less, it was something that popped in my head that I hoped might help when you get back to this. (Not fully understanding the code changes, this might already be something you're handling.)

@shlomi-noach
Copy link
Contributor Author

@tomkrouper the conjecture is as follows:

  • assuming gh-ost breaks while copying rows 5,000-5,100
  • and while reading mysql-bin.000120 at position 123456

it should be OK to resume execution

  • start with copying rows 4,300-4,400 (way before the point of breakage)
  • start with reading binary log mysql-bin.000120 at position 121234 (way before the point of breakage)

this is the conjecture's logic:

  • re-copying same rows just overwrites existing rows (or adds rows that weren't there before!)
  • re-applying binary logs is an idempotent action
    I find it a bit difficult right now to substantiate these claims, but I believe them to be true.

But then, of course, tests are failing...

@Xopherus
Copy link

@shlomi-noach are there any plans to revisit this feature? I'm looking at gh-ost again and one of the concerns my team has is that we have some very large tables that can take days, if not a week to copy. If the process were to crash in the middle we'd have a lot of wasted effort, especially when we have to slowly drain the _gho table to prevent the dreaded global metadata lock when dropping it. This would be extremely helpful for us!

@shlomi-noach
Copy link
Contributor Author

@Xopherus this isn't on the near future's roadmap.
FWIW, we are likewise running week long, or even 22-day long migration in once case. We use

-critical-load-hibernate-seconds=3600

such that hitting critical load doesn't bail out.

I understand the stress involved with running a week long migration. Our history shows those migrations do not break, hence the Resurrection feature is not urgent for us to implement.

@Xopherus
Copy link

Xopherus commented Jul 7, 2018

Thanks for the advice @shlomi-noach! Appreciate the wisdom - I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long. I'll have to try that parameter and let you know how it goes.

@shlomi-noach
Copy link
Contributor Author

I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long

@Xopherus Could you please elaborate on that? I'm not sure I understand.

@Xopherus
Copy link

Oh I just mean that if your migrations can take multiple hours / days, it can be tricky to tune parameters (e.g. critical load threshold or lock cutover timeouts, etc) because it takes longer to experiment. Fortunately we've gotten solid advice from you and others here to help guide us in the right direction.

@daniel-nichter
Copy link

Hi @shlomi-noach :-) I think "resurrect" is not the best term. It's not a standard technical term. Even doc/command-line-flags.md has to clarify: "It is possible to resurrect/resume a failed migration". When people think, "Can I resume an osc?", they'll look for and Google with that term. Imho, "resurrect" will never cross people's minds. By contrast, everyone knows what "resume" (and its reciprocals "suspend" or "pause") mean. I'd also argue that it's not technically descriptive or intention-revealing. A dead body can be resurrected, and I get the joke with the app being called "ghost", but it begs the question: What does it mean to resurrect a program? My last argument is: for non-native English speakers/readers, these issues are compounded by uncommon words in a technical context. -- I'd vote for pause/resume or start/stop.

@shlomi-noach
Copy link
Contributor Author

Thank you @daniel-nichter

@rakhi-s
Copy link

rakhi-s commented Sep 9, 2020

bumping this feature request to check on if there were any changes to make this feature available?

@tomkrouper
Copy link

bumping this feature request to check on if there were any changes to make this feature available?

This code is fairly old and there are a bunch of conflicting files at this point. We don't have any immediate plans to work on this, but I do agree this would be a good feature to have and if anyone would like to continue the work, we'd love the community contribution.

@meiji163
Copy link
Contributor

superseded by #1595

@meiji163 meiji163 closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants